Add 'From Zero to Zarr' beginner guide to the Zarr data model#4077
Add 'From Zero to Zarr' beginner guide to the Zarr data model#4077chuckwondo wants to merge 3 commits into
Conversation
Adds a new user-guide page (docs/user-guide/data_model.md, nav label "Understanding Zarr") that explains the Zarr data model for newcomers: why Zarr exists (its parallel-computing origin in genomics), then arrays, chunking and the chunk grid, stores as key->bytes maps, metadata (zarr.json), the specification, codecs, sharding, groups, and N-D arrays, ending with a runnable round-trip example and a cross-language note. Prose + diagrams throughout, with executable, build-verified code in the final section, and every spec detail linked to its section of the Zarr v3 spec. Enables Mermaid diagrams via a pymdownx.superfences custom fence, and adds the page to the User Guide nav. Closes zarr-developers#4056 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
|
||
| Chunking is the key move. Each chunk can be stored, loaded, and compressed on its | ||
| own, so a program can read just the chunks it needs — that one corner your | ||
| colleague wanted — without touching the rest. (Starting with a chunk shape that |
There was a problem hiding this comment.
could we use an inline admonition for the partial-chunk callout? something like
Note
If each chunk has a fixed size, how can we use chunks to represent an array that isn't evenly divided by the chunk size? See #section for the answer to that question!
not sure if note is the right admonition here
|
@mkitti if you have time it would be good to get your thoughts on this |
| G11 --> K11 | ||
| ``` | ||
|
|
||
| Where does a key like `c/0/1` come from? It's built by a simple, fixed rule (the |
There was a problem hiding this comment.
"fixed rule" locally implies that arrays have 1 chunk key encoding. maybe we can rephrase to make it clear that there's a rule defined by a particular field in array metadata.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4077 +/- ##
=======================================
Coverage 93.50% 93.50%
=======================================
Files 90 90
Lines 11979 11979
=======================================
Hits 11201 11201
Misses 778 778 🚀 New features to boost your workflow:
|
maxrjones
left a comment
There was a problem hiding this comment.
This is awesome @chuckwondo! I just have some small nits
| the *how* one idea at a time, until you understand **how Zarr stores an array**, | ||
| **why** that layout is defined by a written specification, and **how a library | ||
| turns those stored bytes back into an array you can use**. |
There was a problem hiding this comment.
| the *how* one idea at a time, until you understand **how Zarr stores an array**, | |
| **why** that layout is defined by a written specification, and **how a library | |
| turns those stored bytes back into an array you can use**. | |
| the *how* one idea at a time, until you understand **how Zarr stores an array**, | |
| **why that layout is defined by a written specification**, and **how a library | |
| turns those stored bytes back into an array you can use**. |
nit about consistent use of bold text
|
|
||
| --- | ||
|
|
||
| ## Why we need Zarr |
There was a problem hiding this comment.
If possible, it would be nice to have a tl;dr (maybe a note admonition) at the top of this section
| extraordinary firehoses of numbers. A satellite streams images of the Earth; a | ||
| microscope captures gigapixel scans; a gene sequencer reads thousands of genomes; | ||
| a climate model writes out temperature and wind for every point on the globe, hour | ||
| after hour. In each case the result has the same shape: a vast grid of numbers — |
There was a problem hiding this comment.
many people have a negative reaction to em-dashes since their proliferation by AI. It would likely be worth reducing their use in this guide via more, shorter sentences.
| why, it helps to understand two things the array formats of the day were already | ||
| doing. | ||
|
|
||
| First, **chunking**. To store an array bigger than memory, formats like HDF5 and |
There was a problem hiding this comment.
It might be helpful to link to a glossary via hovertools (e.g., approach in https://github.com/developmentseed/datacube-guide/pull/35/changes#diff-98d0f806abc9af24e6a7c545d3d77e8f9ad57643e27211d7a7b896113e420ed2).
| (the [*Anopheles gambiae* 1000 Genomes Project](https://www.malariagen.net/)) — | ||
| arrays far too big to fit in memory. His real frustration was *speed*, and to see | ||
| why, it helps to understand two things the array formats of the day were already | ||
| doing. |
There was a problem hiding this comment.
| doing. | |
| doing: chunking and compression. |
If I'm reading this right, it's not totally obvious what "Second" is
|
|
||
| So a 5×6 array chunked at `(2, 3)` quietly stores a row of "phantom" cells holding | ||
| the fill value. It's harmless, but it's a small waste — and a good reason to pick a | ||
| chunk shape that fits your array's real shape reasonably well. (For practical |
There was a problem hiding this comment.
| chunk shape that fits your array's real shape reasonably well. (For practical | |
| chunk shape that fits your array's real shape reasonably well and lean on the | |
| [rectilinear chunk grid extension](https://github.com/zarr-developers/zarr-extensions/tree/main/chunk-grids/rectilinear) when needed. (For practical |
| [specification](https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#codecs) | ||
| defines three kinds of codec, applied in this order: | ||
|
|
||
| 1. **array → array** codecs (optional, any number) — rearrange the values; e.g. a |
There was a problem hiding this comment.
| 1. **array → array** codecs (optional, any number) — rearrange the values; e.g. a | |
| 1. **array → array** codecs (optional, any number) — rearrange or change the values; e.g. a |
I believe this change is more accurate, but would appreciate if @d-v-b confirms
| simple, but it has a limit: small chunks in a very large array produce a *huge* | ||
| number of chunks, and therefore a huge number of files or objects. The spec notes | ||
| this is exactly where file systems (block sizes, inode limits) and object stores | ||
| (which dislike millions of tiny objects) start to struggle. |
There was a problem hiding this comment.
I think the more prevalent limitation on object stores is the cost model, where the cost of operations often scales with the number of objects
| or more axes. | ||
|
|
||
| To see the generalisation concretely, picture a 3-D array as a **stack of 2-D | ||
| arrays**. Here are two copies of our 4×6 grid stacked into a `(2, 4, 6)` array — |
There was a problem hiding this comment.
| arrays**. Here are two copies of our 4×6 grid stacked into a `(2, 4, 6)` array — | |
| arrays**. Here are two versions of our 4×6 grid stacked into a `(2, 4, 6)` array — |
| - write it to the corresponding slice of the array, | ||
| - discard it, and move on to the next block. | ||
|
|
||
| Because only one block is ever in memory, the array on disk can be far larger than |
There was a problem hiding this comment.
| Because only one block is ever in memory, the array on disk can be far larger than | |
| Because the minimum amount of data ever needed in memory to be useful is a single block, the array on disk can be far larger than |
Adds a new user-guide page (docs/user-guide/data_model.md, nav label "Understanding Zarr") that explains the Zarr data model for newcomers: why Zarr exists (its parallel-computing origin in genomics), then arrays, chunking and the chunk grid, stores as key->bytes maps, metadata (zarr.json), the specification, codecs, sharding, groups, and N-D arrays, ending with a runnable round-trip example and a cross-language note. Prose
Enables Mermaid diagrams via a pymdownx.superfences custom fence, and adds the page to the User Guide nav.
Closes #4056
TODO:
docs/user-guide/*.mdchanges/